Statistical Techniques for Text Classification Based on Word Recurrence Intervals

نویسندگان

M. J. BERRYMAN

A. ALLISON

چکیده

The decision as to whether two texts were written by the same author is usually a difficult one. Can an analysis of how the words in a text statistically cluster shed some light on authorship? In this paper we examine both English texts and the Greek source texts of the New Testament. The mathematical techniqes developed by Shannon [1,2] and Markov have been used for a number of years to analyse sequences of data, whether this be computer code, text, or DNA. These techniques and other probability-based techniques have enjoyed a large amount of usage in analysing DNA sequences [3] well as both written and spoken text [4,5]. Applications of linguistic methods to DNA sequence analysis have been explored by Dong and Searls [6] and others, and this is our motivation for exploring linguistic techniques for authorship (the corresponding problem in the field of DNA research is the phylogeny of organisms based on their DNA sequences). A seminal work in the area of authorship is Mosteller [7], a good overview of other work can be found in Oakes [8]. Durbin et al. [9] is a good reference of work done in analysing DNA sequences. Ortuño et al. [10] suggest using standard deviation of the ‘inter-word spacing’

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

ارتقای کیفیت دسته‌بندی متون با استفاده از کمیته‌ دسته‌بند دو سطحی

Nowadays, the automated text classification has witnessed special importance due to the increasing availability of documents in digital form and ensuing need to organize them. Although this problem is in the Information Retrieval (IR) field, the dominant approach is based on machine learning techniques. Approaches based on classifier committees have shown a better performance than the others. I...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

Statistical Techniques for Text Classification Based on Word Recurrence Intervals

نویسندگان

چکیده

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

A New Document Embedding Method for News Classification

ارتقای کیفیت دسته‌بندی متون با استفاده از کمیته‌ دسته‌بند دو سطحی

عنوان ژورنال:

اشتراک گذاری